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ABSTRACT 

The  problem  of  image  segmentation  is  considered  in  the  context  of 
a  mixture  of  probability  distributions.  The  segments  fall  into 
classes.  A  probability  distribution  is  associated  with  each  class  of 
segment.  Parametric  families  of  distributions  are  considered,  a  set  of 
parameter  values  being  associated  with  each  class.  With  each 
observation  is  associated  an  unobservable  label,  indicating  from  which 
class  the  observation  arose.  Segmentation  algorithms  are  obtained  by 
applying  a  method  of  iterated  maximum  likelihood  to  the  resulting 
likelihood  function.  A  numerical  example  is  given.  Choice  of  the 
number  of  classes,  using  Akaike*s  information  criterion  (AIC)  for  model 
identif ication,  is  illustrated, 
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I .  INTRODUCTION 


A  digital  (i.e.,  numerical)  image  may  be  considered  as  a  rectangular 
array  of  picture  elements  (pixels),  indexed  by  Ci,j).  Ac  each  pixel  the  same 
p  features  are  observed.  We  denote  the  features  by 


The  vector  of  features  is 


. 


=  «i-  ^2 . V- 


The  observed  digital  image  is 

X. .  =  (x, . . ,  X. . . ,  . . , ,  X  . . ; 


where 


is  the  vector  of  numerical  values  of  the  p  features  at  pixel  (i,j). 

Examples .  (i)  In  color  television,  p  =  3  colors,  the  pixels  are  the 

dots  on  the  screen,  and  for  pixel  (i,jj,  x  =  red  level,  x.. .  .  = 

1  i  J  *•  i  J 

green  level,  and  "  blue  level.  (ii)  In  LANDSAT  data,  p=4 

spectral  channels,  one  in  the  green/yellow  visible  range,  the  second  in  the 
red  visible  range,  and  the  other  two  in  the  near  infrared  range. 

An  object  is  a  set  of  contiguous  pixels  which  may  be  assumed  to 

be  members  of  a  common  class.  One  task  of  image  processing  is 

segmentation,  grouping  of  pixels  with  a  view  toward  identifying 

objects . 

In  this  context  the  conceptual  model  is  that  the  image  is  a  set  of 
pixels,  and,  also,  the  image  consists  of  several  segments.  Each  pixel 
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belongs  to  one  and  only  one  segment.  The  segments  fall  into  several 

classes.  For  example,  in  a  picture  of  a  house  the  classes  might  be 

brick,  sky,  grass,  shadow  and  brush.  Note  that  there  might  be  several 

separate  areas  of,  say,  grass.  Each  of  these  areas  is  a  segment, 

but  they  all  belong  to  the  class,  **grass.” 

The  statistical  model  accompanying  this  conceptual  model  is  as  follows: 

--With  each  class  of  segment  is  associated  a  probability 
distribution  for  the  feature  vector  X; 

--With  each  pixel  is  associated  a  label  which,  were  it  known  to  us, 
would  tell  us  which  class  of  segment  the  pixel  belongs  to. 

Each  pixel  thus  gives  rise  to  a  pair  (X,y),  where  X  is  observable  and  Z  is 

rot.  In  the  context  of  this  statistical  model  segmentation  is 

estimation  of  the  set  of  labels. 

The  number  of  classes  will  be  denoted  by  k.  The  algorithms  developed 
here  try  one  value  of  k  at  a  time.  Methods  of  comparing  the  results  for 
different  values  of  k  will  be  discussed. 

Often  one  considers  parametric  models,  in  which  the  class-conditional 
probability  functions  f(xlc)  are  assumed  known,  except  possibly  for 
the  values  of  distributional  parameters.  That  is, 

fCxjc)  =  hi,x;B  ■) , 

where  i  is  the  parameter.  E.g.,  in  the  multivariate  Gaussian  case 

9  consists  of  the  mean  and  covariance  matrix  for  class  c.  The 
— c 

parameters  are  usually  unknown.  However,  image  processing  is  usually 
done  in  a  context  where  there  is  prior  information  about  the  parameters. 

This  can  provide  initial  estimates  for  an  iterative  estimation  algorithm. 

We  shall  write  rather  than  x^ . ,  using  a  single  subscript  t  rather 


than  the  double  subscript  ij  for  the  pixels,  even  though  they  are  in  a 
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two-dimensional  array. 

The  label  associated  with  the  t-th  pixel  will  be  denoted  by 
t  =  1,2,...,  n  ==  IJ.  The  label  is  equal  to  c  if  and  only  if 

pixel  t  belongs  to  class  c.  It  is  convenient  to  represent  the  information 
carried  by  the  label  in  a  k-dimensional  vector  0^  which  consists 

of  k-1  zeros  and  a  single  1,  the  position  of  the  1  indicating  which  segment 
pixel  t  belongs  to;  i.e.,  0  has  a  1  as  its  ^  -th  element  and  O’s 

t  t 

elsewhere.  The  probability  density  function  (p.d.f.)  of  ,  given  6[^, 
is 

^  «ct  » 

where  the  summation  is  for  c  =  l,2,...,k,  and  9^^  is  the  c-th 
element  of  0  . 

II.  THE  PROBABILITY  MODEL 

It  is  assumed  that  the  X's  are  conditi'  -»ily  independent,  given  the  2f’s. 
(More  complicated  models  are  under  study.)  Then  their  joint  p  d.f.  is 
the  product  over  t  =  of  factors  (1.1). 

Note  that,  if  X. ,  , . . . ,  X  are  independent  and  identically  distributed 

with  a  standard  mixture  density 

ffx)  *  I  f(x!c)ir  , 

^  c  c 

where  the  summation  is  for  c  -  l,2,...,k  and  the  s\im  of  the  class  proba¬ 
bilities  ir  is  1 ,  then  (1.1)  gives  the  conditional  deiisity  of  the  X's, 

given  their  labels.  It  is  for  this  reason  that  the  model  used  here  is  called 
the  conditional  population-uiixture  model.  The  standard  mixture  model  has 
been  used  for  pixel  classification;  see,  e.g.,  [7].  Further  discussion  of 
the  conditional  model,  in  'he  conte.xr  of  statistical  cluster  analysis,  and 
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further  references  are  given  in  [10]. 

A  likelihood  approach  is  illuminating  in  that  it  can  show  how 
ad  hoc  optimality  criteria  (objective  functions)  which  have  been  proposed 
relate  to  likelihood  function  in  particular  probability  models. 

Note  that  (1.1)  can  be  written  as  a  product 

®ct 

fUJc)  (2.1) 

where  the  product  is  over  c  =  l,2,...,k.  This  form  is  often  more  convenient, 
and  we  shall  use  it  in  what  follows. 

III.  THE  SEGMENTATION  ALGORITHM 

Using  (2.1)  and  the  conditional  independence  assumption,  one  sees  that 
the  joint  p.d.f.  of  the  ,  given  the  6^,  is 

This  likelihood  is  to  be  maximized  over  all  assignments  of  pixels  to  classes 
and  over  all  permissible  parameter  values.  Many  ^  hoc  schemes  can  be 
applied  to  this  maximization  problem.  E.g.,  one  way  to  maximize  is  to  start 
with  a  given  segmentation,  take  each  observation  successively  and  shift 
it  to  the  first  segment  for  which  a  shift  results  in  an  increase  in 
likelihood,  and  loop  through  the  data  until  no  pixel  changes  classes. 

The  algorithm  developed  here  is  an  iterative,  back-and-forth  procedure. 
We  first  maximize  with  respect  to  (w.r.t.)  the  9*s  (holding  the  P's 
fixed  at  initial  values),  then  w.r.t.  the  B's  (holding  the  6's  fixed 
at  the  values  obtained  in  the  previous  stage),  then  again  w.r.t.  the  0*s 
(holding  the  S's  fixed  at  the  values  obtained  in  the  previous  stage), 
etc.  We  stop  when  no  6  changes,  i.e.,  when  no  pixel  changes  classes,  or 
when  a  specified  amount  of  computer  time  is  used  or  a  specified  number  of 


Sclove:  Application  of  the  Conditional  Population-Mixture  Model  to 
Image  Segmentation 


iterations  has  been  performed. 

An  alternative  way  of  starting  the  procedure  would  be  to  start  with 

an  initial  segmentation  rather  than  with  initial  guesses  of  the  B*s. 

It  is  clear  that,  for  fixed  values  of  the  say  b's,  the 

likelihood  is  maximized,  for  each  t,  by  taking  the  estimate  T  of  0 

ct  ct 


to  be 


T  = 
*Ct 


1  if  h(x  ;b  )  =  max  {hfx 


0  otherwise  . 


(3.1) 


(In  case  of  ties  an  arbitrary  choice  is  made;  e.g.,  the  observation  is 
assigned  to  the  tieing  class  with  smallest  subscript.)  In  other  words, 
segmentation  proceeds  by  allocating  pixel  t  to  that  class  c  for  which  the 
estimated- probability  density  of  the  observation  x  is  largest. 

Note  that,  having  tentatively  estimated  the  6*s  at  any  stage,  i.e,, 
having  tentatively  segmented  the  image,  estimation  of  the  ^*s  is  reduced 
simply  to  ordinary  maximum  likelihood  estimation  in  the  particular  parametric 
family  at  hand.  This  is  a  special  advantage  of  this  approach. 

Let  0  denote  the  set  of  0*s  and  B  the  sec  of  B’s.  Let 

L(B,  0),  or  simply  L  for  short,  denote  the  likelihood.  Let 
denote  the  value  of  B  which  maximizes  L  at  the  s-th  stage  of  the 

( s ) 

iteration,  and  let  0  denote  the  value  of  0  which  maximizes  L 
at  the  s-th  stage  of  the  iteration.  Then  0^^^  maximizes  L(B^^\  0) 
v.r.t.  0,  and  maximizes  L'E,0^^  w.r.t.  B.  This 

back -and- forth  maximization  is  an  example  of  the  relaxation  method  (Southwell ^ s 
method);  see  [7,  pp.  241ff.)  and  (10,11).  It  is  true  that 
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That  is,  at  nc  stage  of  the  procedure  can  the  value  of  the  likelihood 
decrease;  however,  there  is  no  guarantee  of  convergence  to  the  global  maximum 
(neither  do  alternative  clustering  algorithms  guarantee  convergence  to  the 
global  maximum  of  their  objective  functions).  To  see  how  the  procedure 
can  fail  to  converge  to  a  global  maximum,  suppose  it  happens  that 

^  for  all  B, 

or 

LCB*'®^  '01  for  all  ©., 

Then  the  procedure  will  terminate  at  the  s-th  stage,  without  having 
necessarily  reached  the  global  maximum.  That  is,  if,  having  maximized 
w’.r.t.  one  of  the  variables  ^  or  0,  we  happen  to  find  ourselves  at  a 
(relative)  maximum  w.r.t.  the  other,  we  may  not  reach  a  global  maximum. 

In  other  words,  the  procedure  could  conceivably  stop  at  a  multidimensional 
saddle  point. 


IV.  APPLICATION  TO  PARTICULAR  DISTRIBUTIONS 
Now  W'e  consider  application  of  this  general  method  to  particular 
families  of  distributions.  First  we  consider  normal  distributions 
with  common  covariance  matrix,  for  in  this  case  it  becomes  clear  how 
the  model  of  the  present  paper  establishes  a  link  with  some  existing 
clustering  procedures. 

A.  Multivariate  Normal  Distributions  with  Common  Covariance  Matrix 

In  the  case  of  normal  distributions  with  means  ^  *  l,2,...,k. 


and  common  cov*ariance  matrix  I,  the  likelihood  takes  the  form 


where  the  quadratic  form  q  is  given  by 
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q(x;vi,Z)  =  (x  -  y) '  I  (x  -  y) , 

^  ^  ^  <1^  <1^  «i^ 

where  ’  denotes  vector  transpose.  This  quadratic  form  is  the  squared 
CMahalanobis')  distance  between  x  and  u  in  the  metric  of  I.  Here 


(3.1)  is  equivalent  to 


ct 


1  if  q(x^;ni^,S)  =  min^{q(x^;m^,S)} 


0  otherwise, 


(^.1) 


where  m  and  S  are,  respectively,  the  estimates  of  ji  and 

Z,  That  is,  pixel  c  is  assigned  to  that  group  to  whose  tentatively 
estimated  mean  vector  it  is  closest,  where  the  distance  is  in  the  metric  of 
the  tentatively  estimated  covariance  matrix.  Having  estimated  the  0,’s, 
we  have  multivariate  normal  observations  arranged  into  groups;  maximization 
w.r.t.  the  ji^s  and  1  is  accomplished  by  taking  the  group  mean 
vectors  as  estimates  of  the  and  the  within-groups  sum-of-products 

matrix  gives  the  estimate  of  Z.  The  procedure  is  iterated:  using 
new  estimates  m^,  c  =  1,2,. ..,k,  and  S,  the  rule  (4.1)  is  applied 

again.  Then  new  m*s  and  a  new^  S  are  calculated;  etc.  The  Mahalanobis 
distances  can  be  computed  efficiently;  see,  e.g.,  [1,  p.  107]. 

Relation  to  the  isodata  procedure:  This  scheme  is  a  Mahalanobis 


distance  version  of  isodata  [4],  Isodata  proceeds  as  follows.  One 
starts  with  tentative  estimates  of  cluster  means  and  assigns  each 
individual  to  the  mean  to  which  he  is  closest.  ( Isodata  uses  Euclidean 

distance,  or  modified  Euclidean  distance  in  which  different  w^eights  are 
assigned  to  the  p  dimensions.)  The  cluster  means  are  then  re-estimated, 
and  one  loops  through  the  data  again,  reassigning  the  individuals,  etc.  Note 
the  similarity  to  our  scheme:  We  start  with  tentative  estimates  of  the  means 
and  covaraiance  matri.x  and  assign  each  individual  to  the  mean  to  which  he  is 
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closest,  using  Mahalanobis  distance  in  the  metric  of  the  tentatively 
estimated  covariance  matrix.  The  means  and  covariance  matrix  are  then 
re-estimated,  the  individuals  (pixels)  are  re-allocated  to  clusters  (segment 
classes) ,  etc. 

An  important  difference  is  that  our  scheme  employs  Mahalanobis 
distance  rather  than  Euclidean  or  weighted-Euclidean  distance.  It  is 
worth  emphasizing  that  it  is  the  Mahalanobis  distance  based  on  the 
with in -groups  sum-of-products  matrix  that  arises  here;  some  data 
analysts  use  the  total  sum-of-products  matrix,  which  is  not 
appropriate;  see,  e.g. ,  [5].  Thus,  if  one  wants  to  achieve  use  of  a 
proper  metric  by  making  a  linear  transformation  of  the  data,  this  would 
have  to  be  done  at  the  beginning  of  each  iteration,  making  the 
appropriate  transformation  based  on  the  covariance  matrix  estimate 
obtained  at  the  previous  iteration. 

Relation  to  the  k-means  procedure:  Arranging  the  computation 
differently,  updating  the  estimates  of  the  means  and  covariance  matrix 
after  each  individual  pixel  is  assigned  rather  than  waiting  until  all 
have  been  assigned,  produces  a  Mahalanobis -distance  version  of  the  k-means 
procedure  [8]. 

A  numerical  example:  As  a  sample  ’* image”  the  Fisher  iris  data  were 
used.  This  dataset  consists  of  4  features  measured  on  150  flowers,  50  in 
each  of  three  species.  To  form  a  digital  image  the  150  flowers  were  arranged 
into  a  15  X  10  rectangular  array,  rows  1-5  being  species  1,  rows  6-10  being 
species  2,  rows  11-15  being  species  3.  This  means  that  the  true  segmentation 
is  as  follows.  (Note  that,  although  these  data  are  arranged  in  a  rectangular 
array,  no  use  was  made  of  the  spatial  information.  Paper  [11]  is  a 
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preliminary  report  of  the  development  of  algorithms  incorporating  spatial  and 
contextual  information. ) 
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Below  are  given  results  obtained  by  starting  with  initial  means  equal 
to  the  measurements  on  flowers  50,  100  and  150.  (These  are  easy  for 
the  algorithm  in  the  sense  that  they  are  in  fact  from  the  three 
different  species,  but  not  so  easy  as,  e.g. ,  flowers  1,  51  and  101, 
which  are  further  apart.  Starting  with  means  that  are  from  correct 
classes  is  analogous  to  applications  where  something  is  known  about  the 
characteristics  of  the  classes.)  The  results  in  successive  iterations 
were  as  follows.  Convergence  was  reached  on  the  fourth  iteration,  i.e., 
on  the  fifth  iteration  no  pixel  changed  class.  The  execution  time  was 


i 


S.SI  sec.  on  an  IBM  4341. 
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All  computations  reported  here  were  carried  out  using  FORTRAN  computer 
programs  written  by  the  author.  These  programs  have  been  sent  to  the 
Statistics  Program  at  the  Office  of  Naval  Research  for  deposit  in  the 


Naval  Research  Laboratories. 


12 


Sclove:  Application  of  the  Conditional  Population-Mixture  Model  to 
Image  Segmentation 

B.  Multivariate  Normal  Distributions  with  Different  Covariance  Matrices 

The  algorithm  generated  for  this  case  turns  out  not  to  be  simply  to  use 
a  different  Mahalanobis  distance  for  each  cluster,  (The  complication 
which  occurs  is  analogous  to  that  in  classification,  i.e.,  discriminant 
analysis,  where  one  is  led  to  quadratic  discriminant  functions  if  the 
covariance  matrices  differ.)  The  likelihood  is 

**  0  /  2 

Equation  (3.1)  becomes 

“1/2 

1  if  setting  d=c  maximizes  | Z  ,  |  exp[ -qCx^ ;U  , , Z  ,)/2] 

^ct  =  (^-2) 

0  otherwise  . 

Maximizing  the  expression  in  (4.2)  is  equivalent  to  minimizing 
logJZdl  +  q^it^id’ld^ 

This  involves  not  only  the  Mahalanobis  distance  between  the  observation  and 
the  mean  of  the  given  class  but  also  the  logarithm  of  the  determinant  of  the 
covariance  matrix  for  the  given  class. 

It  has  been  noted  (see,  e.g. ,  [6])  that  in  the  standard  mixture 
model  for  this  case  the  supremum  of  the  likelihood  is  infinity.  This 
is  reflected  in  the  fact  that  in  our  algorithm  it  would  be  possible  that  at 
some  “stage  one  of  the  classes  would  consist  of  a  single  pixel,  so  that  the 
tentative  estimate  of  the  mean  of  that  group  would  be  the  feature  vector  for 
that  pixel,  and  the  tentative  estimate  of  the  covariance  matrix  of  that 
cluster  would  be  undefined. 

Numerical  example,  continued:  Results  similar  to  those  for  the  case 
of  common  covariance  matrix  were  obtained  using  the  algorithm  for  this  case, 
with  the  adjustment  for  determinants  of  the  covariance  matrices.  However, 
when  these  adjustments  were  omitted,  and  the  clustering  was  performed 
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using  only  the  Mahalanobis  distances,  without  adding  the  logarithm 
of  the  determinant  of  the  covariance  matrix,  the  results  were  poor. 

Fifty  flowers  were  correctly  assigned  to  class  1,  but  only  6  were  assigned 
to  class  2,  the  remaining  94  being  assigned  to  class  3.  Also,  it  took 
twelve  iterations  and  27  sec.*s  of  CPU  time  to  obtain  this  poor  result. 


V.  COMPARISON  WITH  THE  METHOD  BASED  ON  THE  STANDARD  MIXTURE  MODEL 
Clustering  based  on  the  standard  mixture  model  was  considered  in  [14]. 
Under  that  model  the  posterior  probability  that  Individual  t  belongs  to 
Class  c  is 

TT  h(x  ;6  )/Z,  ir,  h(x  ;B,)  .  (5.1) 

c  ^t  ^c  a  a  «^t 

Individual  t  is  assigned  to  that  class  c  for  which  the  estimate  of  (5.1)  is 
largest,  i.e. ,  to  that  class  for  which  the  estimated  posterior  probability  of 
membership  is  largest.  On  the  other  hand,  with  the  conditional  mixture 
model,  Indiv^idual  t  is  assigned  to  that  class  c  for  which  the  estimate  of  the 
density  largest. 

Wolfe  [14]  has  provided  computer  programs  for  the  standard  mixture  model 
in  the  case  of  normal  distributions.  As  is  well  known,  the  likelihood 


equations  for  mixture  problems  are  relatively  complicated.  In  [14]  they  are 
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solved  by  a  multivariate  Newton*Raphson  iterative  method.  This  involves  the 
assignment  of  arbitrary  initial  values  to  start  the  iterative  solution,  as 

does  the  method  described  here. 

VI.  SOME  REMARKS  ON  STATISTICAL  INFERENCE 

The  maximum  likelihood  estimate  of  (B,0;  is  the  value  (b.t)  for  which 
the  likelihood  L  is  largest.  The  quantity  L(b,t)  is  the  corresponding 
maximum  value  of  the  likelihood.  To  approximate  (t>>0  one  uses  the 
algorithm.  Let  A(B,0)  =  L(B,0)/L(b,t) .  Let  F  denote  the  large 
sample  cumulative  distribution  function  (c.d.f.)  of  -2  log^  A, 

i.e.  , 

lim  ^  ?r[-2  log  A(B,0)  <  x]  =  F(x). 

Suppose  that  F  is  independent  of  the  true  values  (B,0).  E.g.,  it  may  be 

the  c.d.f.  of  a  chi-square  distribution  with  an  appropriate  number  of 
degrees  of  freedom;  it  is  necessary  to  investigate  the  extent  to  which 
the  large  sample  theory  of  the  generalized  likelihood  ratio  applies  when 
there  are  incidental  parameters  (such  as  the  labels). 

A.  Confidence  Sets 

Let  denote  the  upper  a-th  percentage  point  of  F.  Then 

1-a  =  F(x  )  =  Pr[-2  log  A(B,0)  <  x  ] 

=  Pr[-2  log^L(B,0)  $  \  ^  -  log^L(b,t)] 

so  that 

{rB,0):  -2  log^L(B,0)  S  x  +  2  log 

is  an  approximate  100(l-a)%  confidence  set  for  (B,0).  Denote  by 

the  estimates  produced  by  the  algorithm.  Then,  since  there 
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is  a  possibility  these  have  not  quite  converged  to  the  maximum  likelihood 
estimates  (b,t),  we  have  L(b\t^)  S  L(b,t).  Thus  a  conservative 
confidence  set  (one  that  contains  more  values  of  (B*0)  than  the  true 
confidence  set  and  thus  has  confidence  coefficient  at  least  l-o  )  is 

-2  log  L(B,0)  <  +  2  log  L (b' , t ’ ) }  . 

B.  Some  Remarks  on  Choice  of  Number  of  Classes 

One  ad  hoc  approach  to  the  choice  of  number  of  classes  is  to 
follow  the  suggestion  in  [8]  of  introducing  refinement  and  coarsening 
parameters  R  and  C  such  that  two  clusters  join  when  their  mean 
vectors  are  less  than  C  units  apart  and  a  cluster  splits  when  its 
diameter  exceeds  R, 

Another  approach  is  to  run  the  algorithm  with  different  choices  of  k  and 
compare  the  results.  Note  that  the  likelihood  function  is  a  different 
function  for  different  values  of  k.  Denote  this  dependence  upon  k  by  writing 
the  likelihood  as  Ii^(B(k)  ,0(k) ) .  Let  ^(k) ,  t(k)  denote  the  maximum  likelihood 

estimates  for  fixed  k.  Following  the  approach  of  [14]  for  the  standard 
mixture  model,  one  might  make  a  sequence  of  hypothesis  tests  to  decide  on  k, 
first  comparing  L^Cb(2)  ,t 1  with  L^rb(2') , t ^3 ) ) , then  if  necessary 

comparing  L.^(b(3)  ,t(3))  with  L.  (b(4)  ,t(4)) ,  etc.  In  [14]  the  asymptotic 

chi-square  distribution  of  the  generalized  likelihood  ratio  is  used;  even  in 
the  context  of  the  standard  mi.xture  model  this  may  not  be  the  correct  asymptotic 
distribution. 

Still  another  approach  is  to  use  Akaike*s  information  criterion  (AIC). 

fSee,  e.g.,  [1,2].^  This  statistic  is 

AlC(k)  =  -2  log  (max  L.  ]  +  2  m(k). 

e 


Here  m(ki  is  the  number  of  independent  parameters  estimated  when  using 
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k  classes.  According  to  this  viewpoint  on  model  selection,  the  best  model 
is  the  one  that  minimizes  AIC.  According  to  AIC,  inclusion  of  an  additional 
parameter  isappropriate  if  log^[max  L]  increases  by  one  unit  or  more, 

i.e.,  if  max  L  increases  by  a  factor  of  e  or  more. 

Numerical  example,  continued:  There  is  some  question  as  to  whether 
the  Fisher  iris  data  should  be  treated  as  two  or  as  three  species  and 
whether  separate  covariance  matrices  should  be  used  for  the  species. 

CSee  [3,  pp.  109-110].;  Accordingly,  we  compare  by  AIC  the  four  models 
resulting  from  taking  k=2  and  3  and  using  common  and  separate  covariance 
matrices.  The  results  were  as  follows. 

Values  of  AIC 


common 

separate 

covariance 

cov^ariance 

k 

matrix 

matrices 

2 

437.9 

293.8 

3 

231,6 

123.3 

For  both  values  of  k,  the  model  with  separate  covariance  matrices  fared 
better,  and  k=3  gave  a  smaller  value  of  AIC  than  did  k=2. 

VII.  DISCUSSION 

A.  Conclusions 

A  probability  framework  for  clustering/segmentation  problems  has  been 
discussed.  A  general  method  of  producing  algorithms  which  correspond  to  a 
method  of  iterated  maximum  likelihood  has  been  given.  The  general  method 
given  here  is  plausible,  is  linked  to  a  probability  model,  and  is  easy  to 
program.  In  the  case  of  multivariate  normal  distributions  with  common 
covariance  matrix  the  general  method  produces  schemes  which  can  be  viewed  as 
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improved  versions  of  some  existing  schemes. 

B .  Remarks 

The  focus  here  has  been  on  the  parametric  case,  but  the  methods 
discussed  might  be  applied  nonparametrically,  by  estimating  the  p.d.f.'s 
f(x|c)  as  segmentation  proceeds,  using  standard  methods  of  density 
estimation. 

Algorithms  based  on  a  likelihood  function  are  based  on  the  raw  data 
matrix,  in  contrast  to  may  clustering  procedures  which  are  based  on  a  matrix 
of  pairwise  similarities  or  distances.  The  latter  procedures  have  the 
advantage  of  applicability  to  problems  where  a  raw  data  matrix  is  not 
available.  When  the  raw  data  are  available,  such  algorithms  have  the 
theoretical  disadvantage  of  not  extracting  all  the  information  from  the 
observations  and  the  computational  disadvantage  of  preliminary  computation  of 
all  the  pairwise  distances. 

C.  Alternative  Models 

The  focus  here  has  been  on  a  model  in  which  the  labels  are  treated  as 
functionally  independent.  In  the  standard  mixture  model  they  become 
random  variables  and  are  treated  as  statistically  independent. 

To  the  assumption  of  Section  I  is  seems  reasonable  to  add: 

--Each  segment  consists  of  more  than  one  pixel. 

As  a  corollary  to  this  assumption,  it  follows  that  the  labels  are 
functionally  related,  in  as  much  as  each  label  must  be  equal  to  one  of  its 
eight  neighbors.  It  would  be  interesting  to  study  the  problem  resulting 
from,  maximizing  the  likelihood  function  under  this  condition. 

Alternatively,  if  the  labels  are  then  treated  as  random,  they  would  be  a 
two-dimensional  Markov  process.  The  author  has  developed  an  algorithm  for 
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estimation  in  this  Markov  model.  Paper  [11]  is  a  preliminary  report  on 
this;  a  more  detailed  report  is  forthcoming. 
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maximum  likelihood  to  the  resulting  likelihood  function.  A  numerical 
example  is  given.  Choice  of  the  number  of  classes,  using  Akaike's 
information  criterion  (AIC)  for  model  identification,  is  illustrated. 
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